Tom Augspurger, one of the maintainers of Python's Pandas library for data analysis, has an awesome series of blog posts on writing idiomatic Pandas code. In fact you should probably leave this site now and go read one of those blog posts, they're really good. His post on Performance has an especially interesting tip:
"You rarely want to use DataFrame.apply and almost never should use it with axis=1 [which processes the DataFrame row-by-row, "across columns"]. Better to write functions that take arrays and pass those in directly..."
In Tom's example, he has a function with numpy math function calls, and he shows that his function works dramatically faster when those numpy functions are passed entire columns as arguments, which can be processed as vectors. Using .apply(), on the other and, calls those functions on one number at a time through a loop.
It certainly makes sense that the vectorized approach --passing whole DataFrame columns to a function which accepts array(s) as input-- should provide significant speedup for functions with numpy math function calls. We expect those can operate on arrays/vectors directly. But what about text processing?
Here, I'll take a simple text-processing function I've used with .apply() before, and compare its performance with a slightly modified version meant to accept whole DataFrame columns instead of single strings.
First, let's load our dataset - the Quora Duplicate Questions dataset released earlier this year.
In [2]:
import pandas as pd
df = pd.read_csv('datasets/quora_kaggle.csv')
df.head(3)
Out[2]:
The function I'll be testing is a simple text-processing function for tokenizing a string - returning the string as a list of words, after doing a bit of preprocessing.
In [3]:
import re
from nltk.corpus import stopwords
def tokenize(text):
''' Accept a string, return list of words (lowercased) without punctuation or stopwords'''
# lowercase everything
text = text.lower()
# remove punctuation (\W matches any non-alphanumeric character)
text = re.sub("\W", " ", text)
# return list of words, without stopwords (stopwords are very common words which may not convey much info)
droplist = stopwords.words('english')
return [word for word in text.split() if word not in droplist]
tokenize('This is a sentence. And another one with punctuation and special characters to strip!?*&^%')
Out[3]:
It takes about 49 seconds to apply this function to all of our 'question1' questions using .apply():
In [4]:
from datetime import datetime
start = datetime.now()
df['q1_tokenized'] = df['question1'].apply(tokenize)
print('Time elapsed: ', datetime.now() - start, '\n')
print(df[['question1', 'q1_tokenized']].head(3))
..............................................
NOTE: replace 53 seconds above with 0:00:49.371585, the middle time of three runs from my shell. times were more consistent when running from my shell, think the gc provides more comparable times to compare this with next run.
................................................
Let's see if we can speed this up by modifying our tokenize function to accept a Pandas Series of strings, instead of a single string. That way we won't have to use .apply().
In [5]:
def tokenize2(text_series):
''' Accept a series of strings, returns list of words (lowercased) without punctuation or stopwords'''
# lowercase everything
text_series = text_series.str.lower()
# remove punctuation (r'\W' is regex, matches any non-alphanumeric character)
text_series = text_series.str.replace(r'\W', ' ')
# return list of words, without stopwords
sw = stopwords.words('english')
return text_series.apply(lambda row: [word for word in row.split() if word not in sw])
And to measure performance of the (mostly) vectorized approach:
In [6]:
start = datetime.now()
df['q1_tokenized'] = tokenize2(df['question1'])
print('Time elapsed: ', datetime.now() - start, '\n')
print(df[['question1', 'q1_tokenized']].head(3))
--- REPLACE WITH 0:00:11.859043 ... same explanation as above, memory leak in notebook causing increasing times
Vectorizing our tokenizing function netted a >4X speedup. And just as importantly (to me anyway, in most cases): we didn't have to sacrifice code clarity to get the performance gain.
We got this speedup using just two built-in Series.str functions, even with a Series.apply() at the end of tokenize2 that I couldn't figure out quickly how to vectorize (though I bet there's a way to do it). And the code barely changed; to modify the function to accept a Series of strings instead of a string, I just changed:
text.lower to text_series.str.lower(), andre.sub(..., text) to text_series.str.replace(...)This was a really nifty performance tip, especially considering how intuitive, and frankly, idiomatic, it feels to use DataFrame.apply() in so many cases. To quote Tom again: "it's very natural to have to translate an equation to code and think, 'Ok now I need to apply this function to each row", so you reach for DataFrame.apply.'" But as this example shows, vectorizing your functions to accept a whole Pandas Series at a time and avoid .apply() pays large dividends.